Simpson's Paradox

Use admission_data.csv for this exercise.

In [1]:
# Load and view first few lines of dataset
import pandas as pd
import numpy as np

df = pd.read_csv('admission_data.csv')
df.head()
Out[1]:
student_id gender major admitted
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True

Proportion and admission rate for each gender

In [2]:
# Proportion of students that are female
len(df[df['gender'] == 'female'])/df.shape[0]
Out[2]:
0.514
In [3]:
# Proportion of students that are male
1 - _
Out[3]:
0.486
In [4]:
# Admission rate for females
df[df['gender'] == 'female']['admitted'].mean() 
Out[4]:
0.28793774319066145
In [5]:
# Admission rate for males
df[df['gender'] == 'male']['admitted'].mean() #admission rates for females appear to be lower
Out[5]:
0.48559670781893005

Proportion and admission rate for physics majors of each gender

In [9]:
# What proportion of female students are majoring in physics?

# given that a student is female, what is the probability they major in physics 
# that is the proportion of females and physics majors divided by the proportion of females
# since the denominators are the same, we can just get counts of each and take the ratio

df.query('gender == "female" and major == "Physics"').count()[0]/len(df[df['gender'] == 'female'])
Out[9]:
0.12062256809338522
In [13]:
# What proportion of male students are majoring in physics?

df.query('gender == "male" and major == "Physics"').count()[0]/len(df[df['gender'] == 'male']) # many more males apply
Out[13]:
0.92592592592592593
In [16]:
# Admission rate for female physics majors

# That is what proportion of females who apply in physics are admitted
fem_adm_phys = df.query('gender == "female" and major == "Physics" and admitted == True').count()[0]
fem_phys = df.query('gender == "female" and major == "Physics"').count()[0]

fem_adm_phys/fem_phys
Out[16]:
0.74193548387096775
In [17]:
# Admission rate for male physics majors

# That is what proportion of males who apply in physics are admitted 
male_adm_phys = df.query('gender == "male" and major == "Physics" and admitted == True').count()[0]
male_phys = df.query('gender == "male" and major == "Physics"').count()[0]

male_adm_phys/male_phys #female admissions in physics are higher
Out[17]:
0.51555555555555554

Proportion and admission rate for chemistry majors of each gender

In [18]:
# What proportion of female students are majoring in chemistry?
df.query('gender == "female" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'female'])
Out[18]:
0.87937743190661477
In [19]:
# What proportion of male students are majoring in chemistry?
df.query('gender == "male" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'male']) #many fewer males
Out[19]:
0.07407407407407407
In [23]:
# Admission rate for female chemistry majors
fem_adm_chem = df.query('gender == "female" and major == "Chemistry" and admitted == True').count()[0]
fem_chem = df.query('gender == "female" and major == "Chemistry"').count()[0]

fem_adm_chem/fem_chem
Out[23]:
0.22566371681415928
In [24]:
# Admission rate for male chemistry majors
male_adm_chem = df.query('gender == "male" and major == "Chemistry" and admitted == True').count()[0]
male_chem = df.query('gender == "male" and major == "Chemistry"').count()[0]

male_adm_chem/male_chem #fewer males are admitted into chemistry as well as physics
Out[24]:
0.1111111111111111

Admission rate for each major

In [25]:
# Admission rate for physics majors
df[df['major'] == "Physics"]['admitted'].mean()
Out[25]:
0.54296875
In [26]:
# Admission rate for chemistry majors
df[df['major'] == "Chemistry"]['admitted'].mean()
Out[26]:
0.21721311475409835

Many more females applied to chemistry, which had a lower admissions rate. Therefore, they had an overall lower admission rate. Though, females had higher admission rates conditionally in both physics and chemistry. This is known as Simpson's Paradox.

In [ ]: